Within and across sentence boundary language model

نویسندگان

  • Saeedeh Momtazi
  • Friedrich Faubel
  • Dietrich Klakow
چکیده

In this paper, we propose two different language modeling approaches, namely skip trigram and across sentence boundary, to capture the long range dependencies. The skip trigram model is able to cover more predecessor words of the present word compared to the normal trigram while the same memory space is required. The across sentence boundary model uses the word distribution of the previous sentences to calculate the unigram probability which is applied as the emission probability in the word and the class model frameworks. Our experiments on the Penn Treebank [1] show that each of our proposed models and also their combination significantly outperform the baseline for both the word and the class models and their linear interpolation. The linear interpolation of the word and the class models with the proposed skip trigram and across sentence boundary models achieves 118.4 perplexity while the best state-of-the-art language model has a perplexity of 137.2 on the same dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A study in machine learning from imbalanced data for sentence boundary detection in speech

Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody mod...

متن کامل

An Automatic Sentence Bou Based on a Structured La

In this paper we describe an automatic sentence boundary detector, which inserts a period (sentence boundary marker) to a word sequence output by a speech recognizer. The state-ofthe-art automatic sentence boundary detectors insert a period at a position selected by a word tri-gram model from among candidates (long pauses) offered by an accoustic model. In contrast, the automatic sentence bound...

متن کامل

The consistency of sentence intelligibility across three types of signal distortion.

PURPOSE To examine the extent to which sentences retain their levels of spoken intelligibility relative to other sentences in a set (the sentence effect) across different types of signal distortion. METHOD The Central Institute for the Deaf (CID) sentences were rendered difficult to understand through the addition of broadband noise. These intelligibility data were compared with those from pr...

متن کامل

An Investigation on the Relationship between the Grammatical Competence of Young Iranian English Translation Students and their Ability to Translate from English to Farsi

     Today, everything has changed and this has brought a need for learning a second language. Most countries across the world use English as their second/foreign language and the fundamental part of this process is grammar, i.e., the combination of sound, structure, and meaning system of language. A sentence can be composed of several words, clauses, as well as grammatical rules. These grammat...

متن کامل

A hybrid approach for urdu sentence boundary disambiguation

Sentence boundary identification is a preliminary step for preparing a text document for Natural Language Processing tasks, e.g., machine translation, POS tagging, text summarization and etc. We present a hybrid approach for Urdu sentence boundary disambiguation comprising of unigram statistical model and rule based algorithm. After implementing this approach, we obtained 99.48% precision, 86.3...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010